The experimental ideal allows us to assume this:
\[Y_i(0), Y_i(1), X_i \mathrel{\unicode{x2AEB}} D_i\] But we could theoretically get the average treatment effect if we can assume that the assignment is conditionally independent of expected outcomes.
\[Y_i(0), Y_i(1) \mathrel{\unicode{x2AEB}} D_i | X_i\]
Regression controls attempt to meet this more relaxed condition, but methods like matching and subclassification may be able to do so more flexibly.
| Smoking group | Canada | UK | US |
|---|---|---|---|
| Non-smokers | 20.2 | 11.3 | 13.5 |
| Cigarettes | 20.5 | 14.1 | 13.5 |
| Cigars/pipes | 35.5 | 20.7 | 17.4 |
| Smoking group | Canada | British | US |
|---|---|---|---|
| Non-smokers | 54.9 | 49.1 | 57.0 |
| Cigarettes | 50.5 | 49.8 | 53.2 |
| Cigars/pipes | 65.9 | 55.7 | 59.7 |
| Age group | Death rates | # of Cigarette smokers | # of Pipe or cigar smokers |
|---|---|---|---|
| Age 20–40 | 20 | 65 | 10 |
| Age 41–70 | 40 | 25 | 25 |
| Age >71 | 60 | 10 | 65 |
| Total | 100 | 100 |
| Age group | Death rates | # of Cigarette smokers | # of Pipe or cigar smokers |
|---|---|---|---|
| Age 20–40 | 20 | 65 | 10 |
| Age 41–70 | 40 | 25 | 25 |
| Age >71 | 60 | 10 | 65 |
| Total | 100 | 100 |
The un-adjusted death rate for cigarette smokers is:
\[ 20 \times \frac{65}{100} + 40 \times \frac{25}{100} + 60 \times \frac{10}{100} = 29 \]
Their age-adjusted death rate is:
\[ 20 \times \frac{10}{100} + 40 \times \frac{25}{100} + 60 \times \frac{65}{100} = 51 \]
| Age and Gender | Death rates | # of Cigarette smokers | # of Pipe or cigar smokers |
|---|---|---|---|
| Age 20 M | 0 | 10 | - |
| Age 20 F | 1 | - | 1 |
| Age 21 M | 3 | 1 | - |
| Age 21 F | 0 | 7 | 2 |
| Age 22 M | 0 | - | 2 |
| Age 22 F | 1 | 9 | 1 |
| … | … | … | … |
We don’t have any way to “re-weight” the death rates at some values because there’s no data in one side or another.
Methods like subclassification only work when there’s sufficient overlap between treated and control units. This is sometimes referred to as the “area of common support” assumption.
| Age and Gender | Death rates | # of Cigarette smokers | # of Pipe or cigar smokers |
|---|---|---|---|
| Age 20 M | 0 | 10 | 1 |
| Age 20 F | 1 | - | 1 |
| Age 21 M | 3 | 1 | 1 |
| Age 21 F | 0 | 7 | 2 |
| Age 22 M | 0 | - | 2 |
| Age 22 F | 1 | 9 | 1 |
| … | … | … | … |
Note that, in some conditions, the ATT is a more sensible estimand anyways: everyone won’t necessarily take the “treatment”!
Matching is an related method for addressing confounding. In the simplest matching models, we would try to find identify control units with the similar values of all confounders for every treated unit, then compare results.
\[ \text{ATT} = \frac{1}{N_T}\sum_{D_i=1} (Y_i - Y_j(i)) \] (Where \(Y_j(i)\) is the matching unit for \(Y_i\))
If the conditional independence assumption holds, then the average difference between the treated unit and its matched control should be equal to the difference in potential outcomes for the treated units.
In cases where we have more than one match for a single treatment unit, we can average over the matches:
\[ \text{ATT} = \frac{1}{N_T}\sum_{D_i=1} (Y_i -\left[\frac{1}{M} \sum^M_{m=1} Y_{j_m(1)} \right] \] … Where \(M\) represents the number of matching units for \(Y_i\)
Moreover, we can still estimate the ATE with this approach by finding counterfactuals for all treated and control units.
\[ {ATE} = \dfrac{1}{N} \sum_{i=1}^N (2D_i - 1) \bigg [ Y_i - \bigg ( \dfrac{1}{M} \sum_{m=1}^M Y_{j_m(i)} \bigg ) \bigg ] \]
…So how do we identify matching units? In most studies, “exact” matching is ideal, but unfeasible because of that common support requirement (especially when we want to adjust for continuous confounders like age)
But we can use various methods to find similar units instead of exact matches.
For instance, we might use the Euclidean distance to compare vectors of covariates, then find the nearest match for each “point” (this is called nearest neighbor matching)
(Many nearest-neighbor matching studies will use the Mahalanobis distance instead, which is a metric that accounts for correlations between covariates)
The variations here are practically endless:
For nearest neighbor matching, we might increase or decrease the number of matches permitted for each unit.
We might also limit the distance between matched units so that, if nothing is close enough, we discard that match entirely.
We might pair an approximate matching method with additional weighting methods to account for dissimilarities between pairs.
Instead of matching on values themselves, propensity score methods either re-weight or match observations based on a predicted propensity to receive treatment:
From: https://www.andrewheiss.com/blog/2024/03/21/demystifying-ate-att-atu/
From: https://www.andrewheiss.com/blog/2024/03/21/demystifying-ate-att-atu/
In practice propensity scores are often completely fine, but theoretically they can actually worsen bias if the treatment selection process isn’t modeled correctly. Other methods are more robust to this kind of mis-specification.
Coarsened exact matching works by:
| (1) | |
|---|---|
| (Intercept) | 4554.802 *** |
| (408.046) | |
| treat | 1794.343 ** |
| (632.854) | |
| N | 445 |
| R2 | 0.018 |
| logLik | -4542.741 |
| AIC | 9091.482 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
| vs experimental control | vs population | |
|---|---|---|
| (Intercept) | 4554.802 *** | 6984.170 *** |
| (408.046) | (360.710) | |
| treat | 1794.343 ** | -635.026 |
| (632.854) | (657.137) | |
| N | 445 | 614 |
| R2 | 0.018 | 0.002 |
| logLik | -4542.741 | -6346.371 |
| AIC | 9091.482 | 12698.742 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||
(he tried a bunch of stuff, none of it worked)
Candidates for job training were unlike the general population. They weren’t even especially similar to people with similar income, education, age, race, etc.
The treatment group differed from the general population in their expected potential income. So a naive difference of means would be a combination of the impact of job training and the difference in potential outcomes with the general population.
\[\underbrace{E[Y_i(1) - Y_i(0)|D_i=1]}_\text{job training effect} + \underbrace{E[Y_i(0)|D_i=1] - E[Y_i(0)|D_i=0]}_\text{differences between applicants and non-applicants}\]
| Variable | N | population N = 429 |
control N = 260 |
treated N = 185 |
|---|---|---|---|---|
| age | 874 | 25 (19, 35) | 24 (19, 28) | 25 (20, 29) |
| educ | 874 | 11 (9, 12) | 10 (9, 11) | 11 (9, 12) |
| race | 874 | |||
| black | 87 (20%) | 215 (83%) | 156 (84%) | |
| hispan | 61 (14%) | 28 (11%) | 11 (5.9%) | |
| white | 281 (66%) | 17 (6.5%) | 18 (9.7%) | |
| married | 874 | 220 (51%) | 40 (15%) | 35 (19%) |
| nodegree | 874 | 256 (60%) | 217 (83%) | 131 (71%) |
| income 1974 | 874 | 2,547 (0, 9,277) | 0 (0, 279) | 0 (0, 1,291) |
| income 1975 | 874 | 1,087 (0, 3,881) | 0 (0, 655) | 0 (0, 1,817) |
| 1 Median (Q1, Q3); n (%) |
Control variables probably can’t fix this.
Applied for program \(\rightarrow\) Earnings
Applied for program \(\leftarrow\) Education \(\rightarrow\) Earnings
dataFunction<-function(){
N<-1000
education<-rpois(N, 3) # education
D<-rbinom(N, 1, arm::invlogit(education)) # applied to job program
Y_1<- 2 + education^2 + rnorm(100) # potential outcome for training
Y_0<- 0 + education^2 + rnorm(100) # potential outcome for no training
ATE<-mean(Y_1-Y_0) # the true effect
Y<-ifelse(D==1, Y_1, Y_0) # the observed outcomes
df<-tibble("income" =Y, Y_0=Y_0, Y_1=Y_1,
'treated' = factor(D, labels=c("Control", "Treated")),
education)
return(df)
}
set.seed(9999)
data<-dataFunction()
ggplot(data, aes(x=income,y=treated)) +
stat_halfeye() +
theme_bw() +
xlab("Potential Outcome if Y(1)")Notably, since the education confounder has a non-linear effect on income, we don’t actually get a correct estimate even if we control for it:
| no controls | ols + control | |
|---|---|---|
| actual ATE = 1.871 | ||
| (Intercept) | 2.721 | -5.448 |
| (1.187) | (0.411) | |
| treatedTreated | 12.322 | -2.020 |
| (1.259) | (0.455) | |
| education | 7.084 | |
| (0.080) | ||
| Num.Obs. | 1000 | 1000 |
Exact matching on education gets us close to the correct answer:
ate<-mean(data$Y_1 - data$Y_0)
matched<-matchit(treated ~ education ,data=data,
method='exact',
estimand ='ATE'
)
mdat<-match.data(matched)
exact_match_fit <- lm(income ~ treated + education,
data = mdat, weights = weights)
modelsummary(list('no controls'=ols,
'ols + control' =ols_control,
'exact matching' =exact_match_fit),
gof_map = c("nobs"),
note = sprintf("actual ATE = %s", round(ate,digits=3))
)| no controls | ols + control | exact matching | |
|---|---|---|---|
| actual ATE = 1.871 | |||
| (Intercept) | 2.721 | -5.448 | -5.446 |
| (1.187) | (0.411) | (0.335) | |
| treatedTreated | 12.322 | -2.020 | 1.879 |
| (1.259) | (0.455) | (0.304) | |
| education | 7.084 | 5.546 | |
| (0.080) | (0.071) | ||
| Num.Obs. | 1000 | 1000 | 862 |
In repeated simulations, the exact matching estimates are unbiased and have only slightly more variance than the actual average treatment effect:
set.seed(999)
fun<-function(){
df <- dataFunction() # re-using the data generating function
ATE <- mean(df$Y_1 - df$Y_0)
model<-lm(income ~ education + treated, data=df)
matched<-matchit(treated ~ education ,data=df, method='exact' ,estimand='ATE')
mdat<-match.data(matched)
fit1 <- lm(income ~ treated ,
data = mdat, weights = weights)
res<-tibble('regression estimate' = coef(model)[3],
#'matched' = mean(result$difference),
"matched estimate"=coef(fit1)[2],
"Average Treatment Effect" = ATE
)
return(res)
}
results<-replicate(500, fun(), simplify = F)|>
bind_rows()|>
pivot_longer(cols=everything())|>
mutate(value = unlist(value))matched<-matchit(treated ~ education ,data=df,
method ='nearest',
replace=TRUE,
distance='glm')
mdat<-match.data(matched)
#nrow(mdat)
psm_fit <- lm(income ~ treated + education , data=mdat, weights = weights)
modelsummary(list('ols + control' =ols_control,
"exact matching"=exact_match_fit,
"psm" = psm_fit
),
gof_map = c("nobs")
)| ols + control | exact matching | psm | |
|---|---|---|---|
| (Intercept) | -5.448 | -5.446 | -9.452 |
| (0.411) | (0.335) | (0.537) | |
| treatedTreated | -2.020 | 1.879 | 1.479 |
| (0.455) | (0.304) | (0.495) | |
| education | 7.084 | 5.546 | 7.243 |
| (0.080) | (0.071) | (0.077) | |
| Num.Obs. | 1000 | 862 | 965 |
matched<-matchit(treated ~ education ,data=df, method='cem' ,
distance = 'glm', estimand = "ATE")
mdat<-match.data(matched)
cem_fit <- lm(income ~ treated + education , data=mdat, weights = weights)
modelsummary(list('ols + control' =ols_control,
"exact matching"=exact_match_fit,
"psm" = psm_fit,
"cem" = cem_fit
),
gof_map = c("nobs"),
note = sprintf("actual ATE = %s", round(ate,digits=3))
)| ols + control | exact matching | psm | cem | |
|---|---|---|---|---|
| actual ATE = 1.871 | ||||
| (Intercept) | -5.448 | -5.446 | -9.452 | -5.446 |
| (0.411) | (0.335) | (0.537) | (0.335) | |
| treatedTreated | -2.020 | 1.879 | 1.479 | 1.879 |
| (0.455) | (0.304) | (0.495) | (0.304) | |
| education | 7.084 | 5.546 | 7.243 | 5.546 |
| (0.080) | (0.071) | (0.077) | (0.071) | |
| Num.Obs. | 1000 | 862 | 965 | 862 |
Methods like CEM will generally group variables together before weighting, so we would expect to see clustering within the strata, but we know how to deal with this!
Estimate 2.5 % 97.5 %
1.88 1.28 2.48
Term: treated
Type: response
Comparison: Treated - Control
Matching can address some of the limitations of regression models and simulate experimental conditions.
Main advantages over regression are: